Skip to content

[RORDEV-1944] investigating socket hangs up issue#82

Closed
Dzuming wants to merge 12 commits into
masterfrom
fix-tenancy-switch
Closed

[RORDEV-1944] investigating socket hangs up issue#82
Dzuming wants to merge 12 commits into
masterfrom
fix-tenancy-switch

Conversation

@Dzuming
Copy link
Copy Markdown
Collaborator

@Dzuming Dzuming commented Apr 30, 2026

Summary by CodeRabbit

  • Tests

    • Improved E2E test synchronization for greater stability.
    • Added Elasticsearch connectivity monitoring during E2E runs to surface cluster and environment connectivity issues.
  • Chores

    • Integrated the connectivity monitor into automated test workflows to ensure it starts/stops with test jobs for better visibility and troubleshooting.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Cypress synchronization helper and integrates a new GitHub composite action that can start/stop a background Elasticsearch connectivity monitor (host, Docker container, Kubernetes pod) into E2E workflow jobs.

Changes

Tenancy Test Synchronization

Layer / File(s) Summary
Page Object Helper
e2e-tests/cypress/support/page-objects/IndexManagement.ts
Adds static waitingForSectionLoadingFinish() which logs "Waiting for section loading" and asserts sectionLoading test subject does not exist within a 30s timeout.
Test Integration
e2e-tests/cypress/e2e/Tenancy.cy.ts
Imports IndexManagement and calls IndexManagement.waitingForSectionLoadingFinish() after login initialization and before executing the post-login callback to synchronize the test.

ES Connectivity Monitoring Infrastructure

Layer / File(s) Summary
Monitoring Action
.github/es-connectivity-monitor/action.yml
New composite GitHub Action with action input (start/stop) plus es_url, es_user, es_password inputs. start spawns a detached loop (every 5s) that logs host-side /_cluster/health via curl, performs TCP reachability checks from a detected local kbn Docker container and from a discovered Kibana Kubernetes pod (resolving an ES service name), tails recent container logs, writes output to /tmp/es-connectivity-monitor.log, and saves the background PID to /tmp/es-connectivity-monitor.pid. stop attempts to kill the PID and prints the monitor log.
Workflow Integration
.github/workflows/trigger-e2e-tests.yml
Adds "Start ES connectivity monitor" steps (after Docker memory monitor) to master-e2e-tests, master-bootstrap-tests, and develop-e2e-tests jobs, plus corresponding if: always() "Stop ES connectivity monitor" steps to ensure the monitor is stopped.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as "GH Actions runner"
    participant Host as "Host (curl)"
    participant Docker as "Docker kbn container"
    participant KubePod as "Kibana pod (k8s)"
    participant ES as "Elasticsearch"

    Runner->>Host: start es-connectivity-monitor (composite action)
    Host->>Host: spawn background loop (every 5s) -> log to /tmp/es-connectivity-monitor.log
    Host->>ES: curl inputs.es_url/_cluster/health (record status/timing)
    Host->>Docker: exec TCP probe to es-ror:9200 and tail container logs
    Host->>KubePod: kubectl exec TCP probe to resolved ES service:9200
    Note right of Host: loop repeats until stopped
    Runner->>Host: stop es-connectivity-monitor (always)
    Host->>Host: kill PID (best-effort), print /tmp/es-connectivity-monitor.log
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hopped through CI and test-time light,
Waiting for sections till they readjust right,
I listen to ES, from host, container, and pod,
Logging the pulses that keep tests from odd,
A tiny rabbit watcher — steady and bright.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title references a ticket (RORDEV-1944) about investigating socket hangs, but the actual changes focus on adding synchronization waits for index management section loading and monitoring Elasticsearch connectivity. Align the title with the main changes: consider 'Add index management section loading synchronization and ES connectivity monitoring' or similar to accurately reflect the implemented functionality.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-tenancy-switch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e-tests/cypress/support/page-objects/IndexManagement.ts`:
- Around line 113-114: The current spinner waits assume the loader will both
appear and then be hidden, which flaps when loading has already finished or the
element is removed; replace the two strict gets with a resilient check against
the '[data-test-subj="sectionLoading"]' element: attempt to get it with a
timeout, then branch on its existence/visibility so the helper passes if the
loader is already absent, if it appears then waits until it is not
visible/removed, and if it is present but hidden immediately treat that as
success; target the selector '[data-test-subj="sectionLoading"]' when
implementing this logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4dacbc3c-dec7-4a95-943f-11b47c2c9076

📥 Commits

Reviewing files that changed from the base of the PR and between f4c12e5 and 851f4eb.

📒 Files selected for processing (2)
  • e2e-tests/cypress/e2e/Tenancy.cy.ts
  • e2e-tests/cypress/support/page-objects/IndexManagement.ts

Comment thread e2e-tests/cypress/support/page-objects/IndexManagement.ts Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.github/workflows/trigger-e2e-tests.yml (2)

64-68: ES connectivity monitor log is not included in S3 artifact uploads.

The "Stop ES connectivity monitor" step prints the log to stdout only. The subsequent "S3 Upload Videos & show logs" step (on failure) uploads videos and Docker logs, but the ES connectivity log at /tmp/es-connectivity-monitor.log is not explicitly included. If you want to correlate ES connectivity loss with test failures, consider having .github/upload-videos (or a dedicated step) pick up /tmp/es-connectivity-monitor.log.

Also applies to: 125-129, 186-190

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/trigger-e2e-tests.yml around lines 64 - 68, The ES
connectivity monitor's log (/tmp/es-connectivity-monitor.log) is only printed to
stdout in the "Stop ES connectivity monitor" step and isn't included in the
artifacts; update the upload step (the "S3 Upload Videos & show logs" step or
the reusable action .github/upload-videos) to explicitly collect and upload
/tmp/es-connectivity-monitor.log (or add a dedicated step after the "Stop ES
connectivity monitor" step to upload that file to S3/artifacts), ensuring the
file path is added to the list of uploaded logs so ES connectivity traces are
available alongside videos and Docker logs.

39-42: Consider explicitly passing es_url, es_user, and es_password to make configuration explicit rather than relying on action defaults.

The workflow currently relies on the monitor's default values (https://localhost:9200, kibana/kibana credentials), which do match the port mappings in both Docker and ECK test environments. However, explicitly passing these values would improve clarity and reduce fragility if future test infrastructure changes the port or credentials. This is not currently causing issues, but makes the configuration harder to audit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/trigger-e2e-tests.yml around lines 39 - 42, The ES
connectivity monitor step ("Start ES connectivity monitor" using
./.github/es-connectivity-monitor) currently relies on the action's defaults;
update that step to explicitly pass the es_url, es_user, and es_password inputs
(e.g., set with: es_url: "http://localhost:9200", es_user: "kibana",
es_password: "kibana" or the appropriate test creds) so configuration is
explicit and not dependent on action defaults; ensure the input names match the
action's input parameters and keep the existing with: action: start entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/es-connectivity-monitor/action.yml:
- Around line 33-35: The curl run step currently interpolates inputs with `${{
inputs.es_user }}`, `${{ inputs.es_password }}` and `${{ inputs.es_url }}`
directly into the shell command (the `run:` block) which is unsafe; update the
step to bind each input to environment variables (e.g., ES_USER, ES_PASSWORD,
ES_URL) via the step's `env:` and then reference them inside the `run:` script
as `$ES_USER`, `$ES_PASSWORD`, and `$ES_URL`; apply the same change to the other
similar curl steps (those using the same `${{ inputs.* }}` patterns on lines
noted) to avoid command injection and accidental secret leakage.
- Line 45: The workflow currently hardcodes the Docker Elasticsearch target URL
("https://es-ror:9200/_cluster/health?pretty") which isn't configurable; add a
new optional input (e.g. es_docker_host with default "es-ror") to action.yml and
replace the hardcoded host in the probe step with the constructed URL using that
input (e.g. https://${{ inputs.es_docker_host }}:9200/_cluster/health?pretty) so
the Docker probe becomes configurable and consistent with the existing es_url
input; ensure input name matches casing used elsewhere and update any references
to es_ror/es-ror to the new input name.

---

Nitpick comments:
In @.github/workflows/trigger-e2e-tests.yml:
- Around line 64-68: The ES connectivity monitor's log
(/tmp/es-connectivity-monitor.log) is only printed to stdout in the "Stop ES
connectivity monitor" step and isn't included in the artifacts; update the
upload step (the "S3 Upload Videos & show logs" step or the reusable action
.github/upload-videos) to explicitly collect and upload
/tmp/es-connectivity-monitor.log (or add a dedicated step after the "Stop ES
connectivity monitor" step to upload that file to S3/artifacts), ensuring the
file path is added to the list of uploaded logs so ES connectivity traces are
available alongside videos and Docker logs.
- Around line 39-42: The ES connectivity monitor step ("Start ES connectivity
monitor" using ./.github/es-connectivity-monitor) currently relies on the
action's defaults; update that step to explicitly pass the es_url, es_user, and
es_password inputs (e.g., set with: es_url: "http://localhost:9200", es_user:
"kibana", es_password: "kibana" or the appropriate test creds) so configuration
is explicit and not dependent on action defaults; ensure the input names match
the action's input parameters and keep the existing with: action: start entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 015c56f8-d984-43b7-9067-2264c7bbafc4

📥 Commits

Reviewing files that changed from the base of the PR and between 4421f33 and 5b908d0.

📒 Files selected for processing (2)
  • .github/es-connectivity-monitor/action.yml
  • .github/workflows/trigger-e2e-tests.yml

Comment on lines +33 to +35
curl -sk -u "${{ inputs.es_user }}:${{ inputs.es_password }}" \
"${{ inputs.es_url }}/_cluster/health?pretty" \
-o /tmp/es-health.json -w "HTTP %{http_code} in %{time_total}s\n" 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid direct ${{ inputs.* }} interpolation into shell run: blocks — use env: vars instead.

${{ }} expressions in composite-action run: blocks are handled via variable interpolation before shell execution, so embedding them directly is vulnerable to command injection if a caller ever passes a value containing shell metacharacters.

Additionally, the raw input values are written into the runner's temporary shell script on disk before execution, meaning es_password can appear in runner debug traces.

The safe pattern — per GitHub's own composite-action documentation — is to bind inputs to env: variables on each step and reference them as $ENV_VAR inside the shell script.

🔒 Proposed fix — bind credentials via env:
     - name: Start monitor
       if: inputs.action == 'start'
       shell: bash
+      env:
+        ES_USER: ${{ inputs.es_user }}
+        ES_PASSWORD: ${{ inputs.es_password }}
+        ES_URL: ${{ inputs.es_url }}
       run: |
         (
           while true; do
             echo "===== $(date -Is) ====="

             echo "== ES connectivity from host =="
-            curl -sk -u "${{ inputs.es_user }}:${{ inputs.es_password }}" \
-              "${{ inputs.es_url }}/_cluster/health?pretty" \
+            curl -sk -u "${ES_USER}:${ES_PASSWORD}" \
+              "${ES_URL}/_cluster/health?pretty" \
               -o /tmp/es-health.json -w "HTTP %{http_code} in %{time_total}s\n" 2>&1 || true
             cat /tmp/es-health.json 2>/dev/null || true

             echo
             echo "== ES connectivity from KBN Docker container =="
             KBN_CONTAINER=$(docker ps --filter "name=kbn" --format "{{.Names}}" 2>/dev/null | head -1)
             if [ -n "$KBN_CONTAINER" ]; then
               echo "Container: $KBN_CONTAINER"
               docker exec "$KBN_CONTAINER" \
-                curl -sk -u "${{ inputs.es_user }}:${{ inputs.es_password }}" \
+                curl -sk -u "${ES_USER}:${ES_PASSWORD}" \
                 https://es-ror:9200/_cluster/health?pretty \
                 -w "HTTP %{http_code} in %{time_total}s\n" 2>&1 || true
             ...
               if [ -n "$ES_SVC" ]; then
                 kubectl exec -n "$KBN_NS" "$KBN_NAME" -- \
-                  curl -sk -u "${{ inputs.es_user }}:${{ inputs.es_password }}" \
+                  curl -sk -u "${ES_USER}:${ES_PASSWORD}" \
                   "https://${ES_SVC}:9200/_cluster/health?pretty" \

Also applies to: 44-46, 64-66

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/es-connectivity-monitor/action.yml around lines 33 - 35, The curl
run step currently interpolates inputs with `${{ inputs.es_user }}`, `${{
inputs.es_password }}` and `${{ inputs.es_url }}` directly into the shell
command (the `run:` block) which is unsafe; update the step to bind each input
to environment variables (e.g., ES_USER, ES_PASSWORD, ES_URL) via the step's
`env:` and then reference them inside the `run:` script as `$ES_USER`,
`$ES_PASSWORD`, and `$ES_URL`; apply the same change to the other similar curl
steps (those using the same `${{ inputs.* }}` patterns on lines noted) to avoid
command injection and accidental secret leakage.

echo "Container: $KBN_CONTAINER"
docker exec "$KBN_CONTAINER" \
curl -sk -u "${{ inputs.es_user }}:${{ inputs.es_password }}" \
https://es-ror:9200/_cluster/health?pretty \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hardcoded Docker ES hostname es-ror:9200 is not configurable.

The Docker probe target (https://es-ror:9200) is baked in while the host probe target is already an input (es_url). If the internal Docker service name ever changes, this probe silently becomes a no-op (swallowed by || true). Consider exposing it as an optional input (e.g. es_docker_host, defaulting to es-ror) for consistency and future-proofing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/es-connectivity-monitor/action.yml at line 45, The workflow
currently hardcodes the Docker Elasticsearch target URL
("https://es-ror:9200/_cluster/health?pretty") which isn't configurable; add a
new optional input (e.g. es_docker_host with default "es-ror") to action.yml and
replace the hardcoded host in the probe step with the constructed URL using that
input (e.g. https://${{ inputs.es_docker_host }}:9200/_cluster/health?pretty) so
the Docker probe becomes configurable and consistent with the existing es_url
input; ensure input name matches casing used elsewhere and update any references
to es_ror/es-ror to the new input name.

@coutoPL coutoPL changed the title waiting for index management section loading [RORDEV-1944] investigating socket hangs up issue May 18, 2026
@coutoPL
Copy link
Copy Markdown
Collaborator

coutoPL commented May 18, 2026

Nothing to merge here. The fix was tested and works well

@coutoPL coutoPL closed this May 18, 2026
@coutoPL coutoPL deleted the fix-tenancy-switch branch May 18, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants